Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS)
نویسنده
چکیده
The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper seeks to foster a common basis for communication about supercomputer RAS, by proposing a system state model, definitions, and measurements. These are modeled after the SEMI-E10 [1] specification which is widely used in the semiconductor manufacturing industry.
منابع مشابه
Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as they relate to group communication service, including reliable and total order multicast/broadcast, virtual synchrony, and failure detection. While the issue of availability, particularly high availability using replication-based architectures has recently received upsurge research interests, mu...
متن کاملTowards a Specification for Measuring Red Storm Reliability, Availability, and Serviceability (RAS)
The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved, hinders their solution, and increases total system cost. Seeking to foster a common basis for communication about supercomputer RAS, [1] proposed a general system state model, definitions, and measurements based on the SEMI-E10 specification [2] used in the semiconductor ma...
متن کاملReliability, availability, and serviceability (RAS) of the IBM eServer z990
serviceability (RAS) of the IBM eServer z990 M. L. Fair C. R. Conklin S. B. Swaney P. J. Meaney W. J. Clarke L. C. Alves I. N. Modi F. Freier W. Fischer N. E. Weber The IBM eServer zSeries Model z990 offers customers significant new opportunity for server growth while preserving and enhancing server availability. The z990 provides vertical growth capability by introducing the concurrent additio...
متن کاملThe 7U Evaluation Method: Evaluating Software Systems via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models
متن کامل
Measuring Fault Tolerance Overhead in Multi-Run Scientific Computations
Knowing the beneficial or productive usage time for large high performance computing (HPC) platforms is important for computing metrics that capture the reliability, availability, and serviceability (RAS) of the platform. Currently, application execution time is generally accounted as all productive system time, yet large-scale, long-running applications incur fault tolerance overhead such as c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005